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Abstract 

Link prediction is pervasively employed to uncover the missing links in the 
snapshots of real-world networks, which are usually obtained from kinds of 
sampling methods. Contrarily, in the previous literature, in order to eval- 



Q I uate the performance of the prediction, the known edges in the sampled 

snapshot are divided into the training set and the probe set randomly, with- 
out considering the diverse sampling approaches beyond. However, different 
sampling methods might lead to different missing links, especially for the 

00 '. biased ones. For this reason, random partition based evaluation of perfor- 

*^ I mance is no longer convincing if we take the sampling method into account. 

Hence, in this paper, aim at filling this void, we try to reevaluate the perfor- 

Q ■ mance of local information based link predictions through sampling methods 

governed division of the training set and the probe set. It is interesting that 
we find for different sampling methods, each prediction approach performs 
unevenly. Moreover, most of these predictions perform weakly when the 

/\ ■ sampling method is biased, which indicates that the performance of these 

C^ ■ methods is overestimated in the prior works. 

Keywords: Link prediction. Sampling, Complex networks. Performance 
evaluation 



1. Introduction 

Complex networks has been tremendously utilized to characterize the real 
world. For instance, the Internet could be treated as a network constituted 
by routes and physical links among them. While regarding to the Facebook, 
the users could be nodes and the online friendships could be links, which 
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also compose a network. Hence, the complex network is a powerful tool to 
represent the objects and their relations. Moreover, in the real world, the size 
of the network might be extremely large. Taking Facebook as an example, 
it contains nearly 600 millions users currently q Because of the large scale, 
it is indeed much hard for the research community to get a complete and 
rich picture of the network. In addition, even for some small ones, it is also 
difficult to observe some links in the experiments [l|. For the above reasons, 
many research interests have been devoted to the problem of link prediction 
in recent years. Based on simple local information or global evolving rules, 
link prediction can uncover the missing links in the incomplete network and 
even predict the future links that would be generated later. 

In the previous literature, in order to validate the performance of the 
prediction methods, the edges of the known network is usually divided into 
the training set and the probe set randomly. However, in the real world, 
sampling a large-scale network is often not pure random but biased. An in- 
tuitive example is the Breadth-First-Search(BFS) sampling, which is always 
employed to crawl the online social networks |2|, |3| . It has been unrevealed 
that the BFS sampling is not random but biased to the nodes with higher 
degrees j^ , which means it would only extract a dense region of the network 
without reaching to the other parts. Then a natural question could be pre- 
sented that we obtain the training set and the probe set through random 
selection is not reasonable, particular for the snapshot that sampled from 
the network by certain biased methods. Intuitively, for a snapshot obtained 
from a biased sampling method, the previous evaluation of the performance 
for each link prediction method is not convincing because it just correspond- 
ing to the random sampling. Hence, it becomes difficult for us to select a 
proper prediction method for a certain data set if we only rely on the pre- 
cision validation obtained from randomly selected probe set. In fact, recent 
work |5i] has also pointed out this problem and given an excellent illustration 
through selecting probe set based on the edge's popularity. However, a com- 
plete investigation of how sampling methods affect the performance of the 
existing link prediction approaches still remains unclear. In order to fill this 
gap, in the present work, we try to reveal the interaction between sampling 
methods and prediction approaches. 

We employ five pervasively used sampling methods to generate the train- 



^http://en. wikipedia.org/wiki/Facebook 



ing set and the probe set from nine real-world networks of different contexts. 
Given the practical usage, we only consider the local information based link 
prediction methods in the present paper. Through comparison of the per- 
formance, we find that for each of the ten prediction measures, it performs 
unevenly on different sampling approaches. Besides, these measures perform 
poorly as compared to the case of pure random selection of the probe set, 
which indicates that their performance might be overestimated convention- 
ally. 

The rest of the paper is organized as fellows. In Section |2l we would 
present the recent related works. The local information based link prediction 
and the sampling methods would be illustrated in Section [31 In Section HI the 
data sets used in the experiments are depicted. Then the observations and 
remarks of the performance reevaluation in the view of sampling methods are 
introduced in Section [51 Finally, in Section [6l we conclude this work briefly. 

2. Related Works 

Recent years have witnessed growing interests in the link prediction of 
complex networks. It aims to evaluate the likelihood of a link between two 
nodes not connected currently, based on the existing links information |6|. 
The existing methods for link prediction can be divided into three cate- 
gories [TJ. The first method defines a measure of proximity or similarity 
between two nodes in the network, taking into account that links between 
more similar nodes are of higher existing likelihood. Liben-Nowell and Klein- 
berg summarize many similarity measures based on node neighborhoods, the 
ensemble of all paths and higher- level approaches [8| . Motivated by the re- 
source allocation process taking place in networks, Zhou et al. review the 
existing similarity measures and propose a new similarity measure, which 
has great performance in several representative networks drawn from differ- 
ent fields |9|. In the present work, we mainly focus on this kind of methods for 
their simpleness and efficiency. The other two kinds of methods are based on 



the maximum likelihood estimation [10| and machine learning techniques [11 
However, in the prior works, the probe set is generally determined by random 
edge selection. In the recent work [Sj, the authors argue that this conven- 
tional evaluating methods may lead to terrible bias and then study how to 
uncover missing links with low-degree nodes. 

With respect to the growing networks and their tremendously large scales, 
many different sampling methods have been presented in recent years. The 



aim of a sampling method is to derive a representative snapshot of the net- 
work with low cost. The simplest BFS is usually employed to crawl the online 
social networking sites and collect online social networks |3|, |2|. Because of 
the BFS-introduced bias, Gjoka and et. al consider the Metropolis-Hastings 
Random Walk, which is first presented to sample the peer-to-peer network 



unbiasedly [12|, in crawling the Facebook and achieve the goal of uniform sta- 
tionary distribution of nodes [^ . Leskovec and et al. review many sampling 
method and present a new approach named Forest-Fire (FF) which matches 
very accurately both static as well as evolutionary graph patterns, while 



the sampled size decreases down to about 15% of the original graph [13 



Regarding to the drawbacks of random walks, a multidimensional random 



walk,named Frontier Sampling, is presented in [1J| and the authors find that 
this approach is more suitable to sample the tail of the degree distribution of 
the graph. In recent work [l5|, a new method of estimating the original net- 



work's size is depicted, while in [16| , the relation between sampling methods 
and information diffusion in social media is also discussed. However, to our 
best knowledge, little attention has been paid to employing these methods 
to evaluate the performance of the link prediction in complex networks. 

It is worthy to be noted that the recent work |5| has pointed out the prob- 
lem induced by selecting edges into the probe set randomly. It also gives a 
simple but clear illustration of this problem by use of the edge popularity. 
However, the interplay between sampling methods and link prediction ap- 
proaches is still not well investigated. Hence, we try to fill this void in the 
present work. 

3. Preliminaries 

In this section, we mainly depict the basic definitions about the network, 
then the local information based prediction methods and the sampling ap- 
proaches employed later would also be introduced, respectively. 

3.1. Definitions 

In this paper, all the data sets we use could be denoted as an undirected 
graph G(V,£'), where V is the set of objects(nodes) and E is the set of 
relationships(links) among these objects. For each node i, the number of the 
links connected to it is defined as its degree /cj, then the averaged degree of 
Gis 



The nodes that connected to i is defined as n{i), i.e, i's neighbors. Hetero- 
geneity of the network, defined as 



H=U^ (1) 



is usually used to characterize the nonuniformity of degrees [9|. Clustering 
coefficient of a node i is used to characterize how closely its neighbors are 
connected. It can be defined as 

C^ = ^l^^l (2) 

where Ei is the set of ties between Vs neighbors and ki is the degree of i. 
For the case of fcj = 1, we set Cj = in this paper. The averaged clustering 
coefficient of the network can be defined as 

~ \v\ • ^ ^ 



For an undirected edge eij between i and j, we could define its popularity [5| 

as 

epub{i,j) = {ki - l){kj - 1). (4) 

Similarly ,we can define the number of the common neighbors for its two 
ends as 

ecN{i,j) = \n{i)r\n{j)\. (5) 

3.2. Local information based link prediction 

Many link prediction methods has been presented recently. Given the 
practical feasibility, we only investigate the local information based ones for 
their simpleness and low cost. In fact, for an arbitrary prediction method, 
the essence of its algorithm is to score the node pair {i,j), where i,jEV and 
are not connected by an edge in E. After allocating a value s{i,j) for each 
pair {i,j), we only need to sort them in the decreasing order of the score and 
choose the ones with higher values as the predictions. We mainly utilize ten 
popular measures of s{i,j) in this paper and all of them are introduced in 
detail as follows. 

Common Neighbors (CN) It is assumed that two nodes with more common 
nodes are easily to be connected. Then the score of this methods could be 
defined as 

s^^(^,j) = W^)nn(j)| (6) 



intuitively. 

Adamic-Adar Index(AA) This method try to assign more weights to the 
neighbors with lower degrees [l7|, hence, the score could be denoted as 

q&n{i)f\n{j) 

Resource Allocation(RA) In this measure, the score between i and j is 
defined as the amount of resource that j receives from i |9| . Then the score 
could be denoted as 

q£n(i)nn(j) 



Salton Index(SAI) The score is defined as 

/■"(,,) = M)^ (9) 

in this measure. 

Jaccard Index(JI) The score of each pair could also be obtained from 
Jaccard's definition as 

s'%,) = f^^^{^. (10) 

\n{t)Un{j)\ 

S(f)rensen Index(SPI) This measure is presented to the ecological commu- 
nity data sets |9|, the score is defined as 



Hub Promoted Index(HPI) The score in this method is defined as 18 

min{A;j,fcj} 

It could be obtained easily that for the links connected to the nodes with 
higher degrees(hubs) would be allocated higher values [9|]. 

Huh Depressed Index(HDI) Different from the measure of HPI, in this 
method, the score is defined as 

raa,x{ki,kj} 
6 



to decrease the values that allocated to the links connected to the hubs. 



Leicht-Holme-N ewman Index(LHN) This measure is presented in [19| and 
it defines the score as 

sLHN^^^^^^ H^)r^niJ)\ ^ (14) 

Preferential Attachment(PA) This measure is motivated by the mecha- 
nism of preferential attachment in the evolution of scale-free networks and 
the score could be defined as |20|] 

i,j) = kikj. (15) 



,PA, 



As stated in [9j], this method needs minimal information and computation 
complexity among all the measures mentioned here. 

In summary, all these measures would be employed to score each pair of 
unconnected nodes then to predict the pairs that with the higher scores as 
the most plausible ones. 

In order to evaluate the performance of these methods, in the previous 
works, a generally used way is to divide E into two non- overlapped parts, 
including E'^, which is stated as the training set, and E^, which is stated 
as the probe set or the testing set. Clearly we have E = E'^ U E^ and 
E'^dE^ = 0. In a general way, E'^ contains 90% edges in E and the remains 
are allocated to E^. For all the possible links in E, the prediction methods 
only use the information contained in E'^ to score these links, then sort them 
in decreasing order of scores and select the top \E^\ ones into the prediction 
set E'^ . Hence, the precision of a prediction method vr could be defined as 

Precision{n) = \E^ n E'^\/\E^\ 

intuitively. 

In addition, another pervasively used evaluating measure is AUG, which 
could be interpreted as the probability that a randomly chosen missing link 
is given a higher score than a randomly chosen nonexistent link in the pre- 
diction method [9|. In the implementation, among n times of independent 
comparisons, if there are n' times that the missing link having higher score 
and n" times the missing link and the nonexistent link having the same score, 
then the accuracy could be defined as 

AUC=^±^^. (16) 

n 



As stated in [9|, the extent to which the AUG exceeds 0.5 indicates how much 
better the prediction method utihzed performs than the random case. In the 
experiments later, we mainly employ this measure. 

3.3. Sampling methods 

With respect to the growth of real-world networks, especially the emer- 
gence of the large-scale online social networks, many sampling methods have 
been developed to get a representative view of original network. Inspired by 
the motivation of this paper, we aims to sample s/(0 < s/ < 1) edges from 
E to obtain E'^ and then the remaining edges could compose E^ . Hence, 
different from the conventional random selection, the training set and the 
probe set are determined by a certain sampling method. In this paper, we 
mainly employ five typical sampling methods, which are depicted as follows. 

Breadth First Search(BFS) For its simpleness, BFS is always used to 
sample the network, especially for crawling the web and obtaining the online 
social networks. However, as we have stated in the former section that BFS 
is biased to the nodes with higher degrees and might only extract one dense 
core of the network without reaching out to the other parts of the network. 
The procedure of this method could be listed as 

• Step 1: set all the nodes' states to and randomly select a starting 
node i from V . 

• Step 2: add Vejj G -E" to -E''^, where j G n{i). Set i's state to 1, which 
means it has been sampled. Then Vj G n[i), if j is not sampled, add it 
to the sampling queue Q, i.e., Q = QU {j}. If ji^'^l/l-El > s/, then go 
to Step 4. 

• Step 3: Vg G Q, perform Step 2 with i replaced by q. 

• Step 4: lei E^ = E - E'^ and exit. 

Metropolis-Hastings Random Walk(MHRW) The Metropolis-Hastings al- 
gorithm is a general Markov Chain Monte Carlo technique for samplingfrom 
a probability distribution tp that is difficult to sample from directly j^. It 
appropriately modify the transition probabilities so that it converges to the 
desired uniform distribution J4| . The procedure of the method could be de- 
picted as 

• Step 1: randomly select a starting node i from V. 



• Step 2: randomly select the next hop j from n{i). Let Pij = ki/kj. 
Generate a uniform p e [0, 1], if p < min{1.0,pjj}, add ejj E E to E'^ , 
else just let j = i. If |i?'^|/|i5| > Sf, then go to Step 4. 

• Step 3: repeat Step 2 with i replaced by j. 

• Step 4: let E^ = E - E^ and exit. 

It can be easily obtained in Step 2 that in MHRW, it tries to avoid the 
situation of biased to the nodes with higher degrees. 

Frontier Sampling(FS) FS is proposed to implement multidimensional 
random walks in the networks [14]. It performs m dependent random walks 
in the network and m is denoted as its dimension. The procedure of this 
method could be presented as 

• Step 1: randomly select m nodes from V and add them to the seed list 
S. 

• Step 2: select a node i from S with the probability pi = kij J2qes ^i- 
Then randomly select j from n{i), add Cij E E to E^ and replace i by 
i in S. If |i5"^|/|-E| > Sf, go to Step 3, else repeat this step. 

• Step 3: let E^ = E- E'^ and exit. 

Forest Fire(FF) FF is proposed in [13] for sampling large-scale networks 
to empirically analyze their static or dynamic graph properties. This method 
can also used to generated networks as an evolution model [2l|. Its imple- 
mentation could be depicted briefly as 

• Step 1: randomly select a seed node from V and add it to the hurnlist 
B. 

• Step 2: get a node i from B, add all the edges connected to i to E"^ 
and set it as the burned node. Then generate a random number /3 that 
is geometrically distributed with mean 

T^, (17) 

where pf is the forward-burning probability. Select min{/3, fcj} neigh- 
bors from n{i) that are not yet burned and add them to B. \{\E'^\/\E\ > 
Sf, go to Step 3, else repeat this step. 



• Step 3: let E^ = E - E^ and exit. 

Pure Random(PR) In order to compare with the conventional random 
selections of the training set and the probe set, here we import an ideal pure 
random sampling method, which is generally not feasible in the practical 
scenario. We assume that the global view of the network is obtained then 
we would randomly select sj links and add them to i?^, while the remaining 
ones belong to E^ . 

In summary, we would employ these five sampling methods to evaluate 
the ten utilized prediction measures on the real-world networks, which are 
going to be introduced in the next section. 

4. Real-world Data Sets 

In this section, nine real-world data sets we utilize in the present work 
would be depicted in detail. It is worthy to be noted that for all the data 
sets, for the reason of sampling, we only perform experiments on their giant 
connected components. 

These networks come from different fields in the real word. Netscience 
is a network of co-authorships between scientists who are themselves pub- 



lishing on the topic of network science [2^. textttPower is a well-connected 
electrical power grid of western US, where nodes denote generators, trans- 
formers and substations and edges denote the transmission lines between 



them [23|]. USAir is the network of US air transportation system, in which 
the nodes are airports while the links are the airlines among themcl. Yeast 
is an network of protein-protein interactioiij. Dimes is a topology of the In- 
ternet int the level of Autonomous System (AS) and comes from the project 
of DIMEq^. The AS-level data set we use was released at March, 2010. In 
this network, each node represents an AS, while each link means there exists 
an AS path between the related two nodes. Pb is a directed network of US 
political blogcl- Here we treat its links as undirected and self-connections 
are omitted. Caltech is the Facebook network whose ties are within Cali- 
fornia Institute of Technology, in which a node is a user and the friendships 



^http://vlado.fmf.uni-lj. si/pub/networks/data/mix/US Air97.net 
'^http://vlado.fmf.uni-lj.si/pub/networks/data/bio/Ycast /Yeast. htm 
^http://www. netdimes.org 
^http://www-personal. umich.edu/ mejn/netdata/polblogs.zip 
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between users are the links 2J]. Email covers all the email communication 
within a data set of around half million emails. Nodes of the network are 
email addresses and if an address i sent at least one email to address j, the 
graph contains an undirected edge from i to j (25t]. Hepph is from the e-print 
arXiMj and covers scientific collaborations between authors with papers sub- 
mitted to High Energy Physics-Phenomenology category from January 1993 
to April 2003. If an author i co-authored a paper with author j, the graph 
contains a undirected edge from i to j [2l|. 

The basic topological characteristics of these data sets are listed in Ta- 
ble [H 

Table 1: Real- world Data Sets. 



Data Set 


^1 


^1 


(k) 


C 


H 


Netscience 


379 


914 


4.82 


0.74 


1.66 


Power 


4 941 


6 594 


2.67 


0.08 


1.45 


USAir 


332 


2126 


12.81 


0.63 


3.46 


Yeast 


2 224 


6 609 


5.94 


0.14 


2.80 


Dimes 


26 424 


90 267 


6.83 


0.47 


74.66 


Pb 


1 222 


16 714 


27.36 


0.32 


2.97 


Caltech 


762 


16 651 


43.70 


0.41 


1.72 


Email 


33 696 


180 811 


10.73 


0.50 


13.27 


Hepph 


11 204 


117 619 


21.00 


0.62 


6.23 



5. Evaluation from the view of sampling 

In the section, we perform evaluating experiments on different real-world 
data sets and unveil that for different sampling methods, each prediction 
measure performs unevenly. Finally, we also discuss the evaluation of per- 
formance in the situation of tuning sampling parameters. 

5. 1 . Typical Evaluation 

For each data set, we sample 100 times and generate 100 partitions of 
E'^ and E^ with Sf = 0.9. Then for the 100 cases, we employ each predic- 
tion measure to uncover the missing links and get the averaged AUG as its 



^ http://www.arxiv.org 
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performance. For FS and FF, we set m = 100 and pf = 0.8 in the following 
experiments in this subsection. The other configurations of these parameters 
would be discussed in the latter one. 

As shown in Figure [1], each prediction method performs unevenly for 
different sampling approaches, particular for PA. For an instance, as shown 
in Figure [U AUG of CN on Dimes is 0.73 when the probe set is obtained 
by BFS, while it turns to be 0.75, 0.84, 0.69 and 0.87 as the probe set is 
determined by FS, MHRW, FF and PR, respectively. With respect to PA, it 
performs best for PR with AUG equals to 0.86, however, for other sampling 
methods, its AUG decreases, e.g., 0.54, 0.63, o.82 and 0.50 for BFS, FS, 
MHRW and FF, respectively. It is also indicated in Figure [T] that for most 
of the data sets we employ in this work, all the prediction methods perform 
best when the probe set is obtained through the conventional sampling way, 
i.e., PR. This tells us that in the previous work, the performance of local 
information based link predictions might be overestimated. 

As mentioned in l3.2[ most of these measures are related with epub{i,j) and 
ecNihJ)- In order to illustrate the diversity of the performance for different 
sampling methods, we observe the distribution of epub{i,j) and ecN{hJ), 
denoted as P{epub) and P^ccn), respectively, in the probe set. As can be 
seen in Figures |2] and |3l for each randomly selected data set, P{epub) and 
P{^cn) fluctuate diversely for each sampling method. Generally, for PR, 
FS and MHRW, most of edges that are not sampled occupy lower values 
of epub{i,j) and ecN^hJ), while for BFS and FF, the fraction of edges with 
large epub{i,j) and ecN^hJ) is less. Because of this, these prediction methods 
performs better on the probe set generated by PR, FS and MHRW, however, 
it is correspondingly hard for them to uncover the links with lower epub{i,j) 
or ecN^hJ) in probe sets obtained through BFS and FF. 

From the above experiments we could also disclose the proper prediction 
method for different sampling approaches. In fact, as shown in Table |2l 
for each sampling method, there exist several best prediction methods. As 
can be seen, for PR, MHRW and FS, RA perform best on nearly all the 
data sets, which is consist with the evaluation from randomly selected probe 
sets J9|. However, for BFS and FF, SAI performs more outstandingly that 
other prediction measures. It is also interesting that for the measure of PA, 
it performs poorly on all the data sets and for all the sampling methods, 
especially for BFS and FS. 
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Figure 1: Different prediction measures perform on E''" generated by different 
sampling methods. 
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Figure 2: The distribution of epubihj) in the probe set. For each randomly 
selected data set, we obtain 100 probe sets through each sampling methods 
and get the averaged distribution. 
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Figure 3: The distribution of ecN{hJ) in the probe set. For each randomly 
selected data set, we obtain 100 probe sets through each sampling methods 
and get the averaged distribution. 
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Figure 4: Evaluation of performance as m varies for FS. For each m, we 
obtain 100 probe sets and get the averaged AUG as the final performance for 
each prediction measure. 
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Table 2: Best prediction measures for each sampling method. 



Data Set 


BFS 


MHRW 


FS 


FF 


PR 


Netscience 


JI 


RA 


RA 


SAI 


RA 


Power 


AlP 


HPI 


CN/AA/RA/HPI 


SAI/JI/SPI 


RA 


USAir 


SAI 


RA 


RA 


SAI 


RA 


Yeast 


JI 


AA 


RA 


JI/SAI 


RA 


Dimes 


RA 


RA 


RA 


RA 


RA 


Pb 


JI 


RA 


RA 


JI/SPI 


RA 


Caltech 


SAI 


RA 


RA 


SAI 


RA 


Email 


SAI 


RA 


RA 


SAI 


AA 


Hepph 


SAI 


RA 


RA 


SAI 


RA 



a. All measures perform similarly except for PA. 
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Figure 5: Evaluation of performance as pf varies for FF. For each pf, we 
obtain 100 probe sets and get the averaged AUG as the final performance for 
each prediction measure. 
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5.2. Tuning sampling parameters 

In this subsection, we tune the samphng parameters for both FS and FF 
with Sf = 0.90 to observe variations of the performance. Generally speaking, 
regarding to FS, large m is favorable for obtaining edges randomly from the 
network. However, in the real world, it is hard to implement the random 
selection of large number of seeds. Hence, in the following experiments, we 
only tune m from 100 to minjlOOO, |V^|}. With respect to FF, according 
to Eq. [T71 lower pf means at each sampling step, less neighbors would be 
burned. In the following experiments,^/ grows from 0.2 to 0.8. We show the 
evaluation results of tuning m and pf from random selected three data sets 
in Figure m and Figured respectively. As shown in Figure HI as m grows, the 
performance of all the link prediction measures increases and it also begin to 
saturate quickly as m goes up. However, as compared to PR, these measures 
still perform poorly even for the case oi m = minjlOOO, |V^|}. Similarly for 
FF, as can be seen in Figure [5l when pf decreases from 0.8 to 0.2, AUG of 
all the prediction methods increases gradually. Particularly, for the data set 
of USAir as shown in Figure EaJ when pf = 0.2, AUG of SAI, JI, SPI and 
LHN even exceeds the value corresponding to the case of PR, however, the 
gap is little and trivial. For other data sets we show here, the performance 
of these measures is still weaker than the situation when the probe set is 
determined by PR. It is also worthy to be noted that as m or pf varies, 
the performance of PA fluctuates significantly and it tends to perform better 
when the sampling method is PR. 

In summary, as the sampling parameters varies, AUG obtained from FF 
and FS increases closely to or a little bit higher than the value from PR. It 
is still consist with our conjecture that PR supervised division of the probe 
set and the training set would overestimate the performance of prediction 
measures. Actually, large m is difficult to be satisfied, while smaller pf 
would make FF be much time-consuming. 

6. Conclusion 

For the large-scale complex network in the real world, we could only sam- 
ple an incomplete picture of it, because of this, link prediction methods have 
been employed to uncover the missing links in recent years. However, in 
previous works, evaluations of these methods are usually based on dividing 
the known edges randomly into two parts, without considering that the pure 
random partition is even impractical in the real world. For this reason, in 
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this paper, we try to reevaluate the performance of the local information 
based link prediction measures reasonably from the view of several sampling 
approaches that are pervasively utilized in reality. After experiments on 
nine real-world data sets, we find that each of the ten prediction measures 
performs unevenly for different sampling methods. Particularly, for the con- 
ventional means, i.e., the pure random sampling, these measures tend to 
perform best as compared with other sampling approaches. It indicates that 
in the prior work, the performance of the link prediction might be overesti- 
mated. Finally, we conjecture that our findings could take a closer look at 
the performance of the local information based prediction measures and also 
shed light on the problem of how to select a proper prediction method for a 
snapshot obtained through a certain sampling approach. 
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